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Abstract 

This paper presents a scalable method to ef- 
ficiently search for the most likely state trajectory 
leading to an event given only a simulator of a system. 
Our approach uses a reinforcement learning formu- 
lation and solves it using Monte Carlo Tree Search 
(MCTS). The approach places very few requirements 
on the underlying system, requiring only that the sim- 
ulator provide some basic controls, the ability to eval- 
uate certain conditions, and a mechanism to control 
the stochasticity in the system. Access to the system 
state is not required, allowing the method to support 
systems with hidden state. The method is applied 
to stress test a prototype aircraft collision avoidance 
system to identify trajectories that are likely to lead 
to near mid-air collisions. We present results for both 
single and multi-threat encounters and discuss their 
relevance. Compared with direct Monte Carlo search, 
this MCTS method performs significantly better both 
in finding events and in maximizing their likelihood. 

Introduction 

Airborne collision avoidance systems arc man- 
dated worldwide on large transport and cargo aircraft 
to help prevent mid-air collision. Since it entered 
operation in 1993, the Traffic Alert and Collision 
Avoidance System (TCAS) has played a crucial role 
in greatly reducing the risk of mid-air collision [1], 
With air traffic projected to double in the next 30 
years [3], the Federal Aviation Administration (FAA) 
has decided to develop a new aircraft collision avoid- 
ance system capable of addressing the growing needs 
of the national airspace. The next-generation Airborne 
Collision Avoidance System (ACAS X) is currently 
being developed and tested by the FAA and promises 
a number of potential improvements including further 


reduction in collision risk while simultaneously re- 
ducing the number of unnecessary alerts [4], ACAS X 
uses a partially observable Markov decision process 
to model the problem and dynamic programming to 
efficiently compute its solution [4], 

Prior to certification, an aircraft collision avoid- 
ance system must undergo extensive verification and 
validation. A variety of different metrics arc used to 
evaluate the safety, suitability, and acceptability of the 
system [19]. One of the primary safety metrics is the 
likelihood of near mid-air collision (NMAC), defined 
as being when two aircraft come less than 100 feet 
vertically and 500 feet horizontally from one another. 
It is well accepted that the likelihood of NMAC 
cannot be driven to zero due to surveillance noise, 
pilot response delay, and the need for an acceptable 
alert rate. However, it is still important to understand 
the situations where NMACs can arise, even if they 
arc extremely rare. 

There have been a variety of both “white box” 
and “black box” methods applied to the analysis of 
NMAC events. White box methods assume that the in- 
ner details of the system are available to be inspected 
and leveraged in the analysis. For example, Von Essen 
and Giannakopoulou [16] used probabilistic model- 
checking to analyze a simplified version of ACAS X, 
and Jeannin, Ghorbal, Kouskoulas, et al. [18] used an 
automated theorem proven Unfortunately, white box 
methods generally have to rely on approximate rep- 
resentations of the system because of their difficulty 
scaling to the many internal variables governing the 
state of the system. 

In contrast with white box methods, black box 
methods do not use information about the internal 
details of the system. Consequently, they can scale to 
more complicated systems like TCAS and ACAS X 
that have many variables. All that is required in a 



black box approach is the ability to simulate the 
system. Prior analysis of TCAS involved sweeping 
through a low-dimensional parametric model of head- 
on encounters and simulating the collision avoidance 
system [20]. Although this kind of stress testing can 
find NMACs, it is limited to the low-dimensional 
parameterization of the encounter space and it can 
be difficult to assess the likelihood of the various 
NMACs. An alternative approach is to run simulations 
drawn from a statistical representation of the airspace 
[17]. Unfortunately, directly sampling from such a 
model is very computationally inefficient due to the 
rarity of NMACs. 

This paper presents a black box method for 
adaptive stress testing that aims to find the most 
likely scenarios that lead to NMAC. The approach is 
related to reinforcement learning, an area of machine 
learning concerned with making decisions in an un- 
known environment so as to maximize a numerical 
reward [6]. Our problem differs from a traditional 
reinforcement learning problem in that we operate 
on non-Markovian systems and optimize over state 
transitions rather than an explicit set of actions. 

The proposed methodology is quite general and 
is applicable to the adaptive stress testing of a variety 
of different systems. This paper first presents the gen- 
eral methodology and then applies it to the analysis 
of the official binaries encoding the decision logic of 
an AC AS X prototype. 

Problem Description 

We describe the general problem of finding the 
most likely state trajectory that leads to a critical event 
E where only a generative black box simulator 5? 
is available. We assume the system underlying the 
simulator to be discrete-time Markovian with stochas- 
tic transitions. However, the state of the system is 
hidden, making the process non-Markovian from the 
search algorithm’s perspective. We use s, to denote 
the hidden internal state of the simulator 5? at time 
t. 

We specify the inputs to the problem by a pair 
(y,E), where .X is a generative black box simulator 
and E is a subset of the state space where the event 
of interest occurs. The black box simulator exposes 
the system of interest as a discrete-time Markov 
chain with seeded stochastic transitions. Specifically, 
the simulator steps through time drawing a random 


next state at each time step and updating its internal 
state, where the nominal randomness for the sampling 
is pseudorandomly generated from a provided seed. 
The seed can be the pseudorandom seed or state of 
the pseudorandom number generator. The simulator 
exposes three functions for simulation control: 

• initialize(^ 7 ) resets the simulator 5^ to its 
initial state so. 

• STEP {5^,a t ) updates the hidden state of the 
simulator 5? by pseudorandomly sampling a next 
state 5> + i given the current state s t according to 
the system transition probability P(s t + 1 | s t ) and 
the seed a t . The function returns the probability 
of the transition taken and a boolean that indi- 
cates whether s t is in E. Both INITIALIZE and 
STEP transform .X in place. 

• IS TERMINAL (<5^) returns true if the current state 
of the simulator is terminal, and false otherwise. 
We define the terminal time T to be the first time 
at which ISTERMINAL returns true. 

The problem is defined as follows: Given 
(y,E), find the trajectory that contains the occur- 
rence of E and maximizes the likelihood of the 
trajectory Y[J =0 P(s t H I s t ). 

Proposed Method 

The overall strategy of our proposed solution is 
to recast the given problem into a decision-making 
problem and apply a variation of a tree-based re- 
inforcement learning algorithm called Monte Carlo 
Tree Search (MCTS). A system diagram is shown in 
Figure 1. At the center of the method is the system 
under test modeled as a black box simulator. It takes 
as input basic simulator controls (i.e., INITIALIZE, 
STEP, and ISTERMINAL) and a seed, and outputs the 
likelihood of the current transition and a boolean 
indicating whether the current state is an event. The 
outputs are transformed into the reward using the 
reward function and passed to the MCTS algorithm. 
Finally, to complete the loop, MCTS uses the reward 
to choose the seed and control inputs of the simulator. 
We describe each component of the system in detail. 

Reformulation as a Decision Process 

We take the stochastic process defined by ..X and 
insert decision points between each time step where 
each decision is to choose the seed a t . Recall that 
a t controls the stochastic transition of the system by 




Figure 1. System diagram of stress testing method 

specifying a pseudorandomly-generated sample of the 
next state in STEP. As a result, rather than allowing 
the system to evolve naturally (and stochastically), the 
sequence of decisions uniquely determines the state 
evolution of the system. 

Reward Function 

We design a reward function for the decision 
process that is equivalent to the objective in the 
original problem, which is to find trajectories that 
contain events and maximize the likelihood of the tra- 
jectory nf=o^m I s t ) ■ Since reinforcement learning 
maximizes the expected sum of rewards (as opposed 
to product), we choose the following reward function: 

( 0 if s t e E, 

R(s r ,s t+ i) = | -°o if s,(£E,t>T 

[ logP(jf+i I s t ) is s t £E,t <T. 

( 1 ) 

Throughout this discussion, we assume, without 
loss of generality, that terminal states are absorbing. 
An absorbing state is one that transitions to itself with 
probability 1. After initially collecting the reward for 
being in that state, subsequent transitions give rewards 
of 0. Recall that the terminal time T is the first time 
at which the state becomes terminal. As a result, we 
can use the shorthand notation t > T to indicate a 
terminal state, and t < T to indicate a non-terminal 
one. Furthermore, we assume that states in E are 
terminal, so that s t E E implies t >T. 

The first two conditions implement the event 
occurrence constraint. If the trajectory terminates and 


E occurs, the first condition awards a maximum 
reward of 0. However, if the trajectory terminates 
and E does not occur, then the second condition 
infinitely penalizes the learner. The third condition 
in the reward function maximizes the trajectory like- 
lihood by giving a negative reward log P(s t + \ \ s t ) for 
each non-terminal transition. By choosing a reward 
of logP(5 f+ i | s t ), we maximize £;L 0 logP(.q + i | s t ) 
in the reinforcement learning problem, which is in 
fact equivalent to maximizing rXLo^X^+i I s t) in the 
original problem. 

We observe that each term in the reward function 
is non-positive. As a result, the total reward of any 
trajectory, Yj=()R{s t ,s t+ i), will be non-positive as 
well. In fact, if E occurs, then the total reward is the 
log likelihood of the trajectory. Otherwise, the total 
reward is — °°. 

Monte Carlo Tree Search 

One of the most successful sampling-based on- 
line approaches to reinforcement learning in recent 
years is Monte Carlo Tree Search (MCTS). The 
approach has become widely known because of recent 
successes in the game of computer Go [10]. MCTS 
incrementally builds a search tree using sampling and 
forward simulation to inform the search and focus on 
the most promising areas. Here, we describe a varia- 
tion of MCTS that uses double progressive widening, 
a technique that controls the branching factor of 
the search tree [11], This variation is specifically 
necessary when the action space is continuous or so 
large that all actions cannot possibly be explored. The 
latter applies in our case since the actions are in the 
space of pseudorandom seeds, which is too large to 
exhaustively explore. For a detailed review of MCTS, 
see [10]. 

The algorithm involves running many simula- 
tions from the current state while updating an estimate 
of the state-action value function. The state-action 
value function Q(s,a) represents the expected sum 
of rewards resulting from choosing action a in state 
s. The following is an overview of the three stages of 
each simulation: 

• Search. In the search stage, the algorithm starts 
at the root of the tree and recursively selects a 
child to follow. At each visited state node, the 
first progressive widening criterion determines 
whether to choose amongst existing actions, or 






to generate a new one. The criterion limits the 
number of actions at a state s to be no more 
than polynomial in the total number of visits to 
that state. Specifically, a new action is generated 
according to a user-defined function GetAc- 
TION if ||A(s)|| < kN(s) a , where k and a are 
parameters, ||A(s)|| is the number of existing 
actions at state s, and N(s) is the total number 
of visits to state s. Otherwise, the existing action 
that maximizes 

^ < 2 ) 

V N(s,a) 

is chosen, where c is a parameter that controls the 
amount of exploration in the search, and N(s,a ) 
is the total number of visits to action a in state ,v. 
The second term in eq. 2 is an exploration bonus 
that encourages selecting actions that have not 
been tried as frequently. 

At each visited action node, the second pro- 
gressive widening criterion determines whether 
to follow an existing next state node, or to 
draw a new next state from the simulator. The 
second criterion has the same form as the first 
and limits the number of next states to be no 
more than polynomial in the number of visits 
to that state-action pair. We draw a new next 
state if ||V(s,a)|| < k'N(s,a ) a ' , where k' and a' 
are parameters, ||V(s,a)|| is the number of next 
states visited from the state-action pair (s,a), and 
N(s,a) is the total number of visits to ( s,a ). 
Otherwise, we randomly select a next state with 
probability proportional to N(s,a,s'), the number 
of times s' was encountered choosing action a in 
state s. The search stage continues in this manner 
until the system transitions to a state that is not 
in the bee. 

Expansion. Once we have reached a state that 
is not in the bee, we create a new node for the 
state and add it. The set of actions taken from 
this state, A (s), is initialized to empty. 

Rollout. Starting from the state created in the ex- 
pansion stage, we perform a rollout that repeat- 
edly samples state transitions until the desired 
depth (or termination) is reached. State transi- 
tions arc drawn from the simulator with actions 
chosen according to a rollout (or default) policy 


7T(). The total reward of the sampled trajectory is 
returned and used to update the value for Q(s.a) 
used by the search phase. 

Simulations arc run until some stopping criterion 
is met, often simply a fixed number of iterations. We 
then execute the action that maximizes Q(s, a). Once 
that action has been executed, we can rerun the MCTS 
to select the next action. The process is repeated until 
all actions are executed. 

We now describe the specific choices of param- 
eters and modifications we have made to tailor it for 
our purposes. Since we have a deterministic transition 
given the state and seed, there is no need to consider 
multiple samples of the next state s'. As a result, we 
disable the second progressive widening by setting 
k! = 1 and ct! = 0. The choice of these parameters 
makes ||V(s,a)|| <k'N(s,a) a true only once where a 
sample s' is drawn, then false thereafter independent 
of the value of N(s,a). Since there is no reason to 
distinguish among pseudorandom seeds, we choose 
a ~ 7t() and GetAction to sample uniformly from 
all available seed values. The choice of a uniform 
7tb generates unbiased samples of the next state from 
our simulator. We initialize all state-action values to 
Qo(s,a) =0. 

MCTS is designed for stochastic Markovian sys- 
tems, and thus does not generally support systems that 
are non-Markovian. The problem is that the algorithm 
needs the ability to set the system state back to any 
visited state of the bee, and there is no general way 
to do that in the stochastic setting without an explicit 
state representation. Fortunately, since each transition 
from s t to s f+ i is deterministic given action a t , we 
can return to any visited state s t by remembering the 
sequence of actions no:r-i = ao> t taken since 
.Vo. To revisit s t , we first call INITIALIZE to return 
to so~ then repeatedly call STEP with the sequence of 
actions «q :; i ■ When an explicit representation of s t is 
used, we implicitly make the substitution s t = Apr - 1 
reusing (and abusing) the notation for s t . This key 
modification to the algorithm enables our method to 
support Markovian systems with hidden state. The 
process is outlined in Algorithms 1 and 2. 

Figure 2 shows a search bee for one decision 
in a single threat encounter. One thousand iterations 
are shown. The figure shows how MCTS explores 
broadly, but also focuses deeply on a number of 
potentially high-reward areas. 



Algorithm 1 Tailored Monte Carlo tree search with 
double progressive widening 
l: function MonteC arloTreeSearch(=5^, 5 , d) 

2: loop 

3: GOToSTATE^s) 

4: SlMULATE(^,5,r/) 

5: return argmax a Q(s,a) 

6: function Simulate( c 5' !7 ,s,<j0 

7: if d = 0 then 

8 : return 0 

9: if s 0 ST then 

10: ^<-^U{s} 

11: (JV(s),A(s)) <- (' 0,0) 

12: return ROLLOUT( t y',5',r/) 

13: N(s) A- N(s) + 1 

14: if ||iV(s,a)|| < kN(s) a then 

15: a a- GetAction() 

16: (N(s,a),V(s,a)) A- (0,0) 

17: Q(s,a)<-Q 0 (s,a) 

18: A(s) A- A(s) U {a} 

19: a A- argmax a Q(s , a) + 

20: if ||V(s,a)|| < k'N(s,a) a ' then 

21: (P,E) A- STEP(^,fl) 

22: r A- GETREWARD(P,£) 

23: s' A- [j,a] 

24: if s' V(s,a) then 

25: V(s,a) A- V(s,fl)U]V} 

26: R(s,a,s') A- r 

27: N(s,a,s') A- 0 

28: else 

29: N(s,a,s') A- N(s,a,s') + 1 

30: else 

31: n <— Y tS >N(s,a,s') 

32: s' A- Sample(V(s, a), N(s, a, -)/n) 

33: r A- R(s,a,s') 

34: N(s,a, s') A- N(s,a,s') + 1 

35: <? A- r + /SIMULATE^,/,;/ - 1) 

36: N(s,a) A- N(s,a) + 1 

37: Q(s,a) A- Q{s,a) + q ~^ ] 

38: return q 

39: function ROLLOUT (y,s,d) 

40: if cl = 0 then 

41: return 0 

42: a ~ 7To 

43: (P,E) A- STEP(^,a) 

44: r A- GETREWARD(P,£') 

45: 5 ' A- [ 5 , a] 

46: return r + /ROLLOUT^, s', r/ — 1) 


Algorithm 2 MCTS auxiliary function 
l: function GoToState^,^) 

2; dQ : t—i i — s 
3: INITIALIZE^) 

4: for each a in ao:r-i do 

5: STEP(^,fl) 



Figure 2. Visualization of the MCTS search tree 
for one decision point (1000 iterations shown) 

Aircraft Collision Avoidance 

ACAS X is targeted to replace TCAS as the stan- 
dard collision avoidance system for transport aircraft 
worldwide. Before widespread acceptance, the safety 
of the system must be demonstrated. Monte Carlo 
evaluations of the system in realistic encounters play 
an essential role in system assessments. Safety eval- 
uations are performed using validated noise models 
and incorporate altimetry bias effects. One important 
safety metric is the risk ratio, defined as the proba- 
bility of an NMAC with a collision avoidance system 
divided by the probability of an NMAC without one. 
Figure 3 shows the estimated risk ratios of NMAC for 
TCAS and ACAS X based on 1.5 million encounters 
generated by the Lincoln Laboratory Correlated Air- 
craft Encounter Model (LLCEM) [12] [13] with both 
aircraft equipped with a collision avoidance system. 

As shown in Figure 3, ACAS X provides a sub- 
stantial safety benefit relative to TCAS while simul- 
taneously reducing the alert rate. Although there are 
many other operational considerations that are impor- 
tant to system acceptance, this paper focuses on better 
understanding the remaining NMAC risk associated 
with ACAS X. While 1.5 million encounters provide 
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Figure 3. ACAS X and TCAS metrics. 


insights into their relative safety, characterizing en- 
counters that are still risk-bearing is important to 
system development. Using conventional methods to 
produce a meaningful set of NMAC examples would 
require an extremely large number of simulations. 
This work shows how the search can be performed 
much more effectively using a reinforcement learn- 
ing approach. Insight gained from this research will 
inform the development process. 

We stress test a prototype of ACAS X following 
the method described earlier by constructing a simu- 
lation environment, recasting it as a decision process, 
and solving it using MCTS. Insights gained from this 
work will inform design decisions for future iterations 
of the collision avoidance system. An overall diagram 
of our approach is shown in Figure 4. 

Encounter Simulation 

At the core of the analysis is a simulation 
that models the various components of a mid-air 
encounter. We implement the components using the 
SISLES.jl framework [14]. Our simulation includes 
an encounter model, sensor model, collision avoid- 
ance system, pilot model, and an aircraft dynamics 
model. We describe each component in detail. 

Encounter Model: Encounter models are used to 
initialize encounters in such a way to be both realistic 
and likely to lead to NMACs. Initial states of aircraft, 
including positions, velocities, and headings are set 
in this manner. Pairwise (two-aircraft) encounters use 
the Lincoln Laboratory Correlated Aircraft Encounter 
Model (LLCEM), and multi-threat (three-aircraft) en- 
counters use the Star Encounter Model. 

LLCEM is comprised of two parts. The first is 
a Bayesian network that models the geometry of two 



Figure 4. System diagram for stress testing pair- 
wise ACAS X encounters 


aircraft at point of closest approach, and the second 
is a dynamic Bayesian network that models how pilot 
commands transition over time. Both of these models 
are learned from a large body of radar data over 
the entire national airspace [12]. To obtain a set of 
encounter initial conditions from the two models, we 
follow the simulation and transformation procedure 
in the paper. 

The Star Encounter Model initializes aircraft on 
a circle heading towards the origin and at equal angles 
apart. Initial airspeed, altitude, and vertical rate are 
sampled from a uniform distribution. The horizontal 
distance from the origin is set such that without 
intervention, the aircraft crosses the origin at a preset 
time. For our experiments, we set the crossing time 
to 40 seconds. 

Sensor Model: The sensor model captures how 
the collision avoidance system perceives the world. 
We assume active, beacon-based sensor capability 
with no noise. The main sensor inputs from own 
aircraft are Oo = {z,z, W,h}, where 

• z is vertical rate 

• z is barometric altitude 

• i/r is the heading 

• h is height above ground 

For each intruding aircraft i, the main sensor 
inputs are Oi = {r s ,x,z}, where 






• r s is slant range (relative distance to intruder) 

• x is bearing (relative angle to intruder) 

• z is altitude 

ACAS X System: ACAS X is the system under 
test. We use an official binary of a prototype obtained 
from the FAA. The binary exposes the functions 
INITIALIZE and STEP, but is otherwise a black box. 
It is known that ACAS X maintains internal state, 
but the state is not exposed. Although there arc many 
output variables of the ACAS X system, the pilot is 
mostly concerned with the issued resolution advisory 
(RA), which instructs the pilots to climb or descend 
at a particular rate. 

ACAS X has a coordination mechanism to en- 
sure that issued RAs do not conflict with each other, 
such as to recommend two aircraft to maneuver in 
the same vertical direction. The messages are com- 
municated to all nearby aircraft through coordination 
codes. 

Pilot Model: There arc two aspects to modeling 
the pilot. First is the intended commands, which is 
what the pilot would have done if there were no 
conflicts. Second is the pilot response model, which 
is how the pilot responds to an RA. 

For the intended commands, we use the proba- 
bilistic model given by the dynamic Bayesian network 
in LLCEM. Samples are drawn according to the tran- 
sition probabilities given. For the response model, we 
use the deterministic pilot response model described 
in [15]. The model assumes that pilots respond to 
initial RAs with a 5-second delay then accelerate 
towards the recommended target rate h target . Pilots 
respond to subsequent RAs (i.e., strengthenings and 
reversals) in the same manner except with a 3-second 
delay. During the delay period, the pilot continues to 
execute actions associated with the previous RA. That 
is, the pilot continues to fly the intended trajectory 
during initial RA delays, and continues responding 
to the previous RA during subsequent RA delays. 
Multiple RAs issued successively are queued so that 
both their order and timing arc maintained. In the case 
where a subsequent RA is issued within 2 seconds or 
less of an initial RA, the timing of the subsequent RA 
is used and the initial RA is skipped. The output of 
the pilot model is a command given by {a fid, Yd} > 
where 

• a is commanded airspeed acceleration 

• h t j is commanded vertical rate 


• Yd i s commanded turn rate 

Aircraft Dynamics Model: The aircraft dynamics 
model determines how the state of the aircraft propa- 
gates given the pilot commands. The aircraft state is 
given by x = {v,N,E,z, y/\ 0,0}, where 

• v is airspeed 

• N is position north 

• E is position east 

• z is altitude 

• y is heading angle 

• 0 is pitch angle 

• <j> is roll angle 

We use forward Euler integration at 1 Hz to 
propagate the aircraft state. 

Reward Function Modification 

When using the reward function in eq. 1 to 
search, all trajectories not meeting the event constraint 
return a reward of — °o and so the algorithm cannot 
distinguish amongst them. As a result, the algorithm 
is largely just using Monte Carlo to search for event 
trajectories. In reality, some of these non-event tra- 
jectories are much closer to an event occurrence than 
others. If the closeness can be quantified, we can 
greatly speed up the algorithm by modifying the 
reward function to incorporate this information. In 
particular, instead of returning — °o when the event 
does not occur, we return a large negative reward 
correlated with how “close” the trajectory came to 
reaching an event. This modification has the effect of 
making the reward function more gradual and guiding 
the search to focus on areas near events. 

In the case of mid-air encounters and NMACs, 
an obvious closeness metric is the miss distance D m j ss , 
defined as the Euclidean distance between aircraft at 
their closest point. The miss distance is a good metric 
because it is monotonically decreasing as trajectories 
get closer to the event E and is minimum at E. We 
use this modified reward function for all our ACAS X 
experiments. 

Single Threat Encounters 

We performed studies on two-aircraft encounters 
with initial conditions sampled from the LLCEM. The 
algorithm searches the space of trajectories starting 
from the encounter’s initial conditions and returns the 
trajectory with the highest reward found. Although we 
have crafted the reward function to find NMACs, the 
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algorithm may or may not have been successful at 
finding one. A returned trajectory where an NMAC 
does not occur could mean either that the number of 
samples used is insufficient, or that one cannot be 
reached from the initial conditions. With the configu- 
ration shown in Table I, NMACs were found in 18% 
of analyzed encounters. We manually clustered the 
resulting trajectories and present our findings. 

Table I. Single Threat Study Configuration 


Simulation 

Number of aircraft 

2 

Encounter Model 

LLCEM 

Sensors 

Active. Beacon-Based, Noiseless 

Collision Avoidance System 

ACAS X 0.8.5 

Pilot Response Model 

ICAO 5s-3s 

MCTS 

depth 

50 

iterations 

2000 

exploration constant 

100.0 

k 

0.5 

a 

0.85 

k' 

1.0 

a' 

0.0 


Crossing Time: A number of NMACs resulted 
from well-timed vertical maneuvers. In particular, 
aircraft crossing in altitude during the delay period 
of an initial RA tends to be problematic. Figure 5 
shows one such encounter that eventually ends in 
an NMAC at 36 seconds. The probability density of 
this trajectory evaluated using LLCEM is 5.3 • 10 1 x . 
This quantity, which we will refer to as the likelihood 
metric, can be used as a relative measure of how likely 
is a trajectory to occur. In this encounter, the aircraft 
cross in altitude during pilot l’s delay period. This 
results in aircraft 1 starting the climb from below 
aircraft 2. Unfortunately after this has occurred, there 
is not enough time to recover using a reversal due to 
subsequent pilot response delays. 

High Turn Rates: Turns, especially those at 
higher rates, tend to complicate the conflict resolution 
process by quickly shortening the time to closest ap- 
proach. ACAS X does not have full state information 
about its intruder and must estimate it by tracking rel- 
ative distance, relative angle, and the intruder altitude. 
Figure 6 shows an example of an encounter that has 
similar crossing behavior as Figure 5 but exacerbated 
by the high turn rate of aircraft 2 (approximately 
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Figure 5. NMAC trajectory where altitude crossing 
occurs during pilot’s initial response delay. 


1.5 times the standard turn rate). In this scenario, 
the aircraft become almost head-on at time of closest 
approach and a reversal is not attempted. An NMAC 
with a likelihood metric of 6.5 • 1CU 17 occurs at 48 
seconds. 

Initially Moving Against RA: We found that a 
number of NMACs were caused by the pilot initially 
moving against the issued RA before proceeding to 
fully comply with it. Recall that the pilot is following 
the intended trajectory during the first 5 seconds 
of an initial RA, and this intended trajectory may 
disagree with the RA. In most cases, the disagree- 
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Figure 6. NMAC trajectory where high turn rate 
plays a leading factor. 


ment must be severe to cause an NMAC, where the 
pilot aggressively accelerates against the RA. Perhaps 
unsurprisingly, this makes it extremely difficult for 
the collision avoidance system to resolve the conflict 
despite full compliance later on. 

Sudden Aggressive Accelerations: Sudden ma- 
neuvers can lead to NMACs when they are suffi- 
ciently aggressive and occurring at just the right time. 
In particular, we observed some encounters where 
two aircraft are approaching one another separated 
in altitude and flying level, then one aircraft suddenly 
accelerates vertically towards the other aircraft as they 


are about to pass. Under these circumstances, given 
the pilot response delays and dynamic limits of the 
aircraft, the collision avoidance system often does not 
have enough time to resolve the conflict. 

Fortunately, the chances of such maneuvers are 
exaggerated in our modeling since ACAS X issues 
traffic alerts (TAs) to alert pilots to nearby traffic. 
These warnings typically occur well in advance of 
issued RAs to help pilots from entering into situations 
where RAs are needed. The pilot response model used 
in this work does not capture the effect of the TAs. 

Combination of Factors: In our experiments, it 
is very rare for an NMAC to occur due to a single 
cause. Typically a combination of factors contribute 
to the eventual NMAC as shown in the example 
encounter in Figure 5. Although crossing time played 
a crucial role, there are a number of other factors 
that are important. The horizontal behavior where 
they arc turning into each other is significant as it 
reduces the time to closest approach. The two vertical 
maneuvers of aircraft 1 before receiving an RA are 
also important. Similar observations can be made on 
nearly all NMAC encounters found. 

Multi-Threat Encounters 

We performed studies on three-aircraft encoun- 
ters with initial conditions sampled from the Star 


Encounter Model. With the configuration shown in 

Table II, NMACs were 

found in 25% of analyzed 

encounters. 


Table II. Multi-Threat Study Configuration 

Simulation 

Number of aircraft 

3 

Encounter Model 

Star Model 

Sensors 

Active, Beacon-Based, Noiseless 

Collision Avoidance System 

ACAS X 0.8.5 

Pilot Response Model 

ICAO 5s-3s 

MCTS 

depth 

50 

iterations 

1000 

exploration constant 

100.0 

k 

0.5 

a 

0.85 


1.0 

a' 

0.0 


Pairwise Coordination in Multi-Threat: Our al- 
gorithm discovered a number of NMAC encounters 
where all aircraft are issued a “multi-threat” RA and 
asked to follow an identical climb rate. Unfortunately, 
compliance with the RA results in the aircraft even- 
tually closing horizontally without gaining vertical 
separation. We show an example of such an encounter 
in Figure 7 where an NMAC with likelihood metric 
of 5.8- ICC 7 occurs at 38 seconds. 

In discussing these results with the ACAS X 
development team, we learned that this behavior is 
a known issue that can arise from performing multi- 
aircraft coordination using a pairwise coordination 
mechanism. The pairwise coordination messages in 
essence determine which aircraft will climb and 
which will descend in an encounter. Since coordina- 
tion messaging occurs pairwise, under rare circum- 
stances it is possible for each aircraft to receive con- 
flicting coordination messages from the other aircraft 
in the scenario. In normal encounters, the aircraft that 
receives conflicting coordination messages from two 
aircraft remains level and lets the other aircraft climb 
or descend around it. Although uncommon, this is an 
important case that is being addressed by both TCAS 
and ACAS X development teams. 

Maneuverable Space: In general, multi-threat 
encounters are more difficult to resolve than pair- 
wise encounters because there is typically less open 
space for the aircraft to maneuver. Figure 8 shows 
an example of an NMAC trajectory where aircraft 
1 (the aircraft in the middle altitude between 10 
and 36 seconds) needs to simultaneously avoid an 
aircraft below and a vertically closing aircraft from 
above. An NMAC with a likelihood metric of 1.0- 
1 0 1 6 occurs at 39 seconds. Aircraft 2’s downward 
maneuver greatly reduces aircraft l’s maneuverable 
airspace. To prevent aircraft 2 from maneuvering 
downwards would likely require preemptive action by 
the collision avoidance system occurring much earlier 
in the encounter. Admittedly, this is an extremely 
challenging scenario that is somewhat unfair as a test 
case. Nevertheless, insight can be gained by looking 
closely at how the collision avoidance system handled 
the problem. In this case, ACAS X on aircraft 1 
decides to issue a crossing RA rather than to remain 
sandwiched. 
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Figure 7. NMAC trajectory where aircraft are in 
a coordination deadlock. 

Pairwise Phenomenon: As one would expect, 
phenomena that appear in pairwise encounters also 
appear in multi-threat encounters, albeit typically ex- 
acerbated by the presence of the third aircraft. In our 
multi-threat analysis, we noted similar phenomena 
related to crossing time, initially moving against the 
RA, and sudden aggressive accelerations discussed 
previously. We did not observe the impact of high 
turn rates in the multi-threat case due to our use of 
the Star Encounter Model. 
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Figure 8. NMAC trajectory where an aircraft must 
avoid intruders from above and below. 


Performance 

We compared MCTS against a direct Monte 
Carlo search algorithm, “MCBEST,” given a fixed 
computational budget. To do this, we make a small 
modification to the MCTS algorithm to enable precise 
control of the computation time used. Instead of 
terminating based on number of iterations as we had 
previously done, we iterate until the allotted computa- 
tion time is spent. For MCBEST, we repeatedly draw 
Monte Carlo samples until the allotted computation 
time is spent, then return the trajectory with the 
highest total reward. 



Figure 9. A study of reward vs. computation time 
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Figure 10. Number of NMACs found vs. compu- 
tation time 


Figure 9 compares the two algorithms by the 
total reward of the resulting trajectory. Each data 
point reflects 100 pairwise encounters and the mean 
and standard error of the mean are shown. Figure 10 
shows the number of NMACs found out of the 100 
encounters searched. In both cases, MCTS clearly 
performs better than the baseline Monte Carlo search. 
The effectiveness of MCTS in finding NMACs is of 
particular importance, and we see that MCTS greatly 
outperforms direct Monte Carlo search in this regard. 


Conclusion and Future Work 

Air travel is one of the safest forms of transporta- 
tion available today due to safety systems like TCAS. 
ACAS X seeks to make air travel even safer by further 
reducing collision risk while simultaneously reducing 
the number of unnecessary alerts. To achieve this, the 
design, development, and testing of ACAS X all use 
state-of-the-art techniques. 

In this paper, we have proposed a novel approach 
to stress testing black box systems and demonstrated 
its utility in stress testing ACAS X. Since the rein- 
forcement learning method is very general, we hope 
to apply it to a broad range of domains. To facilitate 
this, we plan to release an open source Julia package 
of our implementation in the near future. 

There arc several areas of further work. We 
would like to leverage information in the problem that 
is already available but currently not utilized to tty to 
speed up the search. Examples of such information 
include the distribution of rewards collected from 
rollouts as well as the state-action value estimates of 
one action relative to another. Another extension is 
to leverage the existing search tree to estimate other 
useful information such as the overall probability of 
an event. A further interest is in leveraging the search 
tree to cluster trajectories so that a compact set of 
representative NMAC examples can be produced. 
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